Background

The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.

Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas

We need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards

We need to identify the best possible model that will give the required performance

Objective

Data Set

Load and overview the dataset

Check the percentage of missing values in each column

Let's check the number of unique values in each column

Summary of the data

Observations

Check for null values

Let's check the count of each unique category in each of the categorical variables.

Replace abc values in Income_Category to nan so we can impute these values later

Observations

EDA

Univariate Analysis

Visualize Customer Age

Visualize Dependent_count

Visualize Months_on_book

Visualize Total_Relationship_Count

Visualize Months_Inactive_12_mon

Visualize Contacts_Count_12_mon

Visualize Credit_Limit

Visualize Total_Revolving_Bal

Visualize Avg_Open_To_Buy

Visualize Total_Amt_Chng_Q4_Q1

Visualize Total_Trans_Amt

Visualize Total_Trans_Ct

Visualize Total_Ct_Chng_Q4_Q1

Visualize Avg_Utilization_Ratio

Outlier Detection

Observations:

Let's define a function to create barplots for the categorical variables indicating percentage of each category for that variables.

Visualize Attrition_Flag

Visualize Gender

Visualize Education_Level

Visualize Marital_Status

Visualize Income_Category

Visualize Card_Category

Observations:

Bivariate Analysis

Heatmap

Observations:

Plot Attrition (Dependent Variable) Vs Independant Variables:

Observations:

Plot categorical variables Vs Dependent Variable (Attrition)

Define function to plot

Attrition Flag vs Gender

Attrition Flag vs Education_Level

Attrition Flag vs Marital_Status

Attrition Flag vs Income_Category

Attrition Flag vs Card_Category

Observations:

Data Preparation for Modeling

Encoding the Output Variable (Attrition_Flag)

Split the dataset into train and test sets

Note: We will use the X_test data as training data in models where we do not use cross validation , and X_train and X_val will be used in models where we use cross validation

Missing Value Imputation

As we saw earlier, our data has missing values. We will impute missing values using mode since all missing values are for categorical variables. We will use SimpleImputer to do this.

Validate the Data Post Missing Value Imputation :

Dummy Variable Creation :

Let's create dummy variables for string type variables and convert other column types back to float.

Encode/rename the data from categorical variables

- Income_Category
- Education_Level
- Card_Category

Building the model

Model evaluation criterion:

Model can make wrong predictions as:

  1. Predicting a customer will leave the bank and the customer doesn't leave - Loss of resources
  2. Predicting a customer will not leave the bank and the customer leaves - Loss of opportunity

Which case is more important?

How to reduce this loss i.e need to reduce False Negatives?

Let's create two functions to calculate different metrics and confusion matrix, so that we don't have to use the same code repeatedly for each model.

Model 1 : Logistic Regression (m1_lg)

Model 2 : Decision Tree Classifier (m2_d_tree)

Model 3 : Bagging Classifier with Decision Tree Base Estimator (m3_bagging_estimator)

Model 4 : Random Forest Classifier (m4_rf_estimator)

Model 5 : AdaBoost Classifier (m5_adaBoost_classfr)

Model 6 : Gradient Boosting Classifier (m6_gradBoost_classfr)

Training Performance Comparision : All Classifier Models Above Without Sampling

Testing Performance Comparision : All Classifier Models Above Without Sampling

Insights :

2 Best Models based on 1st 6 models : Gradient Boost Classifier , AdaBoost Classifier

Model building - Oversampled data

Upsampling : SMOTE

Model 7 : Logistic Regression - Oversampled Using SMOTE (m7_lg)

Model 8 : Decision Tree Classifier - Oversampled Using SMOTE (m8_d_tree)

Model 9 : Decision Tree Bagging Classifier - Oversampled Using SMOTE (m9_bagging_estimator)

Model 10 : Random Forest Classifier - Oversampled Using SMOTE (m10_rf_estimator)

Model 11 : AdaBoost Classifier - Oversampled Using SMOTE (m11_adaBoost_classfr)

Model 12 : Gradient Boosting Classifier - Oversampled Using SMOTE (m12_gradBoost_classfr)

Training Performance Comparision : Over Sampling Using SMOTE

Testing Performance Comparision : Over Sampling Using SMOTE

Insights (Models with Oversampled data)

2 Best Models with Oversampled data : Gradient Boost Classifier and Adaboost classifier

Model Building : Under sampled Data (ClusterCentroids)

Downsampling : ClusterCentroids

Model 13 : Logistic Regression - Under sampled Using ClusterCentroids (m13_lg)

Model 14 : Decision Tree Classifier - Under sampled Using ClusterCentroids (m14_d_tree)

Model 15 : Decision Tree Bagging Classifier - Under sampled Using ClusterCentroids (m15_bagging_estimator)

Model 16 : Random Forest Classifier - Under sampled Using ClusterCentroids (m16_rf_estimator)

Model 17 : AdaBoost Classifier - Under sampled Using ClusterCentroids (m17_adaBoost_classfr)

Model 18 : Gradient Boosting Classifier - Under sampled Using ClusterCentroids (m18_gradBoost_classfr)

Training Performance Comparision : Under Sampling Using ClusterCentroids

Testing Performance Comparision : Under Sampling Using ClusterCentroids

Insights (Undersampling)

Conclusion: Models with undersampling are of lower performance compared to those of oversampling and without sampling. Therefore we will not use models with undersampled data

Best 3 Models

Comparision of Best Models so far :

Training Performance Comparision

Testing Performance Comparision

Insights :

Conclusion : 3 Best Models

Hyperparameter Tuning

MODEL 1 : Tuning the Gradient Boost Classifier without sampling

MODEL 2 : Tuning the GradientBoost Classifier with Oversampled data

MODEL 3 : Tuning the AdaBoost Classifier (OverSampling)

Tuned Model Performances

Comparing Performance of training data

Comparing Performance of testing data

Insights: Final Model Selection

Pipelines for productionizing the model

Feature Importance

Top 5 Features to identify customer churn (Ranked by decreasing order of importance):

Lets plot the above listed features for existing and churned customers

Business Recommendation (Based on Feature Importance and EDA):